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1. Background of the Invention 

Clock is very important component of the chip that impacts all aspects of the design. Difference in clock signal 
arrivals (the maximum of them is called clock skew) and any variation in clock skew or clock period (this 
variation is called jitter) decrease performance (operating frequency) and may lead to a functional failure (setup 
or hold time violations). There are the following most significant factors that lead to the constant difference in 
clock signal arrivals (clock skew S) or may dynamically change S: 

Fl. Imperfect clock tree synthesis and resulting clock tree with non zero clock skew S (the ideal tree 

should be all-path-delay balanced and have zero skew). 
F2. The crosstalk. Transitions in signal nets dynamically and stochastically impact clock net delays. This 
impact results in additional (positive or negative) xtalk incremental delay in the clock nets that 
increases clock skew S (and clock jitter). 
F3. The variations in PVT (process, voltage, and temperature) parameters. Namely PVT parameters (or 
conditions like power dissipation, temperature, transistor sizes, wire width, layer thickness, 
gradients in doping, local hot spots, voltage drops, etc.) are not constant on a chip in different its 
areas and impact cell timing. As factor F2, some PVT parameters (like voltage and temperature) 
change dynamically and stochastically in space and time, and are unpredictable. 
F4. Chip functionality. The current clock synthesis tools do not understand design functionality in a sense that 
they data transfer from one flip-flop to another. It means that these tools ignore information about timing 
criticality of paths between flip-flops, which should meet setup or hold time requirement. It results in decreased 
performance or more timing violations that have to be fixed later. 

These 4 factors impact the clock signals arrivals at the flip-flops (FFs) of timing critical paths and decrease 
performance and this decrease becomes even more significant for higher performance designs (smaller the clock 
periods). These factors also can lead to multiple timing violations (setup and hold time limits). To address these 
problems an iteration process of fixing/optimization is required, which is very time consuming. 

A current design flow includes the following main steps: 

Cell placement (usually time-driven placement). After this step all cells including FFs are fixed. 
Clock implementation (synthesis) to find the clock tree with minimum clock skew (or so-called useful 
clock skew). 

Clock protection (by using additional spacing or shielding around clock nets). 

Crosstalk and static timing analysis (STA) that shows if the design meets timing (has timing closure). 

If there is no timing closure, then cell placement and the clock implementation should be fixed. This is 
very time consuming process with iterations thru many design stages. 
The current clock synthesis tries to optimize only clock skew S (considers only one factor Fl) and does not 
address factors F2-F4. Moreover, we will show that if cell placement is accomplished without taking into 
account the following step of clock implementation, then it is likely that there does not exist such clock tree that 
can be good in terms of all factors Fl -F4. 

We developed a new approach, where an initial (partial) clock synthesis is included in cell placement. It drives 
cell placement to find a solution that is optimal for clock synthesis — it find such cell placement that it becomes 
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easy to satisfy requirements of all factors F1-F4. Some of the developed ideas are implemented in several 
optimization scripts and prototypes. The intention is to implement all these features in lsimrs tool for FS3.5 
release. 

We believe that this invention disclosure is very important because the following criteria: 

1 . Novelty of the invention. It is a significant improvement over the existing state of the art. It is a new 

approach with many benefits. 

2. This invention will be in LSI internal use starting with FS3.5 release. It will be implemented in lsimrs 

and new design flow. This product incorporating the invention has a long technology life cycle of 
the commercial use (more than 5 years). So, LSI will be able to obtain a lot of benefits from the 
patent. 

3. This invention is valuable, because there are not many ways or alternate technologies to "design 

around" the invention. Moreover, any alternate design would add a significant cost. 

4. It is relatively easy to detect that a competitor is using this particular invention. It can be detected just 

from looking on placement of FFs. Thus, it is important to get patent protection for LSI. 
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2. Description of the Problem 

Fig. 1 shows a simplified flowchart for the current clock synthesis. 



Time-driven cell placement 



Minimum-clock-skew clock synthesis 



Clock protection 



Crosstalk and static timing analysis 




Fig. 1. Current clock synthesis simplified flow 



Fig. 2 shows one path P between Flip-flop_J (FF_1) and Flip-flop_2 (FF_2). Any path contains a chain of logical 
cells (combinational logic). Propagation delay D(P) along P is a sum of cell delays DC(P) (it is practically a 
constant) and signal (wire) delays DW(P) along P. Wire delays DW(P) are greater if a sum L(P) of wire lengths 
along P is greater. Usually path P has a big delay D(P) if it contains a lot of cells (in this case DC(P) is large), and 
wire lengths L(P) is big. 
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Rip-flop, 
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^> CLK 



Chip 




Flip-flop_2 



a 

^> CLK 



Path P (combinational logic) 



Fig. 2. Two Flip-flops placed on the chip and a path P between them 



To simplify future consideration let us use simpler graphical representation for path P that is shown in Fig. 3. 
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FF_1 



FF_2 




Chip 

Fig. 3. Two Flip-flops FFJ and FF_2 placed on the chip and path P (simplified representation) 



Current cell placement (Fig. 1) is performed to minimize path delays D(P), because maximum D(P) should be 
less than the clock period, and, thus, it defines the design performance. It means that the current cell placement is 
time-driven. Let us assume that path P in Fig. 3 is the result of such cell placement and D(P)=Do, and L(P)=Lo, 
where Lo is comparable to chip size, and this placement meets timing. During next step (clock synthesis) in Fig. 
1, a clock tree will be created (inserted). See Fig. 4. 
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L(P)=Lo 




Clock Input 



Chip 

Fig. 4. Clock tree with several clock buffers and clock nets is inserted. 

The clock tree in Fig. 4 has two long different branches (1 and 2) to FF_1 and FF_2, because FF_1 and FF_2 are 
placed far away from each other at distance L(P). Now we will show that with such kind of placement (that is 
optimal in sense of minimum path delay D(P)) is not good for building a good (in sense F1-F4 factors) clock tree. 
Let us consider each factor at a time. 

Factor Fl: This clock tree will likely have a big clock skew, because FF_1 and FF_2 are located far away from 
each other; clock branches 1 and 2 have different routing (and possibly clock buffers), and it is difficult to 
balance this clock tree. 
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Factor 2: Clock branches 1 and 2 have different clock nets in then and these nets have different aggressor nets. 
Thus, crosstalk-injected delay in branches 1 and 2 may be significantly different and one can speed up clock 
signal and another can slow down clock signal. This will increase clock skew and will introduce a clock jitter. 

Factor 3: Clock branches 1 and 2 go thru completely different chip areas. Thus, PVT conditions in branches 1 
and 2 will be different (they are different across the die). Therefore, ends of branches 1 and 2 will show much 
more delay differences over PVT than two ends, which would have shared the same branch. This delay difference 
will too increase clock skew. 

Factor 4: If delay D(P) is small, then the delay difference in branches 1 and 2 should be minimal (zero clock 
skew). If delay D(P) is large (and some other conditions that will be discussed later), it may be useful to have 
delay in branch 2 more than in branch 1 . Because current clock synthesis does not see design functionality, it 
does not distinguish between branches 1 and 2. Also, clock branches 1 and 2 go thru completely different chip 
areas and it is difficult (as we have already discussed above) to do any precise clock delay tuning. 

We showed that the current clock synthesis and flow would lead to a big clock skew and jitter. Also, factors Fl- 
F4 should to be taken into account during the timing closure of the design by applying more pessimistic delays on 
paths (STA tool like PrimeTime should add additional uncertainty). It makes timing closure more difficult to 
meet and leads to more design iterations. If we do not account for these effects during timing closure, then 
timing in these paths can fail in silicon (crosstalk effects and PVT variations become more significant in smaller 
technologies and more complex designs). 

Situations become even more complex (in terms of finding an optimal clock tree) if we consider real designs, 
where paths can be "connected" thru common flip-flops. See example in Fig. 5 with 2 connected paths P(FF_1, 
FF_2) and P(FF_2, FF_J$). In general case it maybe any number of connected paths. A traditional clock tree for 
this example is shown in Fig. 6 and is not good in sense of F1-F4 factors for the same reasons we have already 
discussed with example in Fig. 4. 
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FF_1 



FF_2 



Path P (combinational logic) 







D O 

^> CLK 








FF.3 



Chip 



Fig. 5. Two connected paths (FF_1-FF_2, FF_2-FF__3) 
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Chip 



Fig. 6. Two connected paths and traditional clock tree. 

To meet timing closure we have to consider not all paths, but only timing critical. A path is timing critical (for the 
setup requirement) if the propagation delay along the path is more than some part (like 90%) of clock period. 
Usually path P is setup critical if it contains a lot of cells (like 40-70; in this case DC(P) is large), and the average 
wire length between cells in P more than some small value (like 200 micron); or path has average number of cells 
(like about 2-30), but the average wire length between cells in P more than some big value (like 500 micron). A 
path is timing critical (for the hold requirement) if the propagation delay along the path is less than some part 
(like 10%) of clock period. Usually path P is hold critical if it contains a few (none) cells (like 0-5; in this case 
DC(P) is very small), and the average wire length between cells in P less than some small value (like 200 micron). 

If we consider only timing critical paths, then all paths constitute several groups (a partition on all paths) such 
that: 
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- Each critical path belongs to some group; 

- Each group contains only connected critical paths; and 

- If two critical paths are connected they belong to the same group. 

We establish that the current clock synthesis and cell placement do not recognize, analyze, use, or take into 
account the path partition, and the clock tree that results from such flow will not take into account factors F1-F4 
and will be not optimal. 
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3. Proposal 

Now, let us consider another eell placement with the same length Lo for L(P) as it in Fig. 4 , and hence the same 
D(P)=Do. See Fig. 7. 




FF^1 
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Chip 



Fig. 7. Another placement of FF_1 and FF_2 with the same D(P) 

As one can see the main difference is that flip-flops FF__1 and FF_2 are placed close to each other. Thus, now we 
can insert a clock buffer and place it close to both FF_1 and FF_2. See Fig. 8. 
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Clock buffer 



Chip 

Fig. 8. Optimal placement of FF__1 and FF_2 with clock sub-net. 



The clock tree in Fig. 8 has two short connections (1 and 2) to FF_1 and FF_2 that constitutes a clock sub-net, 
because FF_1 and FF_2 are placed close to each other. Now we will show that that this cell placement (that is 
also optimal in sense of minimum path delay D(P)) and clock sub- tree is optimal (in sense F1-F4 factors) to build 
the optimal clock tree. To build the optimal clock tree we can use the current clock synthesis applied to the sub- 
tree that is already created in Fig. 8. See the result in Fig. 9. 
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Clock Input 



Chip 



Fig. 9. Optimal placement of FF_1 and FF_2 and clock tree. 



Let us show that this clock tree is optimal (in sense F1-F4 factors. Let us consider each factor at a time. 

Note: The clock tree has two parts: clock branch 1 and the sub-tree. Branch 1 can be excluded from the 
consideration (for all factors F1-F4), because it is a common part of both clock paths from clock input to FF__1 
and FF_2. Any delay or crosstalk-injected delay, or PVT cause delay will be the same for FF_1 and FF_2. 

Factor Fl: This clock sub-net will likely have a very small clock skew, because FF_1 and FF_2 are located next 
to each other; connections 1 and 2 have different routing, but almost the same length and delay. The whole clock 
tree may have a non-zero clock skew (as it usually have for the current clock synthesis), but this skew will be 
between different partition parts. It means that this skew is between flip-flops that do not talk to each other. Thus 
even relatively big skew is acceptable. 
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Factor 2: Connections 1 and 2 are parts of the same clock sub-net and, thus, have the same aggressor nets. 
Therefore, crosstalk-injected delay in connections 1 and 2 are practically the same and will not increase clock 
skew and will not introduce a clock jitter. 

Factor 3: Connections 1 and 2 go thru the same chip area and are driven by the same clock driver. Thus, PVT 
conditions in connections 1 and 2 will be the same. Therefore, ends of connections 1 and 2 will show practically 
the same delay differences over PVT and it will not increase the clock skew. 

Factor 4: If delay D(P) is small, then the almost zero delay difference in connections 1 and 2 is the best solution 
(Fig. 8, 9). If delay D(P) is large, it may be useful (these cases will be discussed later) to have the delay in 
connection 2 more than in connection 1. This solution is shown in Fig. 10. 
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Fig. 10. Optimal clock sub-net with additional clock buffers. 

Note. If an additional delay is useful to introduce into connection 2, then the objective to place FF_2 close to 
FF_1 can be relaxed and it is allowed that additional clock buffers are placed in different areas than FF_1 or 
FF_2. See a solution in Fig. 1 1. In this case factors F1-F4 are not so important, because anyway we want to have 
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some non-zero skew between FF_1 and FF_2. The clock buffer still should be placed as close a s possible to 
FF_1. 




-> 



Clock sub-net 



hid <3' 
- ^> CLK 



i Additional 
i clock buffers 



Chip 

Fig. 11. Optimal placement and clock sub-net with additional clock buffers. 



Fig. 12 shows a simplified flowchart for the new developed clock-driven placement and clock synthesis. We start 
with the simplest case, where the goal is zero clock skew (Fl) and other factors F2-F4, but we do not introduce 
"useful" clock skew. This approach takes into account minimization (optimization) all 4 factors F1-F4. It knows 
about timing paths (especially critical ones) between the flop-flops and making use of this information. The cell 
placer not only minimizes critical paths, but also minimizes the size of area containing each partition group, 
which contains connected timing critical paths. 
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Fig. 12. New clock synthesis simplified flow for zero skew goal 



Block brief description. 

Block 0. Start point. No cell placement. No information on critical paths. Thus we have no partitioning of the 
critical paths. 

Block 1. This is a cell placement where additionally to timing optimization we try to place flip-flops in each 
partition group as close to each other as possible. An example of an initial placement (first iteration) for one 
group is shown in Fig. 5. After several iterations, a possible result is shown in Fig. 13. 
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Fig. 13. Optimal placement of partition group 

Block 2. Running Static Timing Analysis to find all timing critical paths, and hence start and end flip-flops in 
these paths. 



Block 3. Partitioning all flip-flops in critical paths. If any two critical paths have a common flip-flop, then all flip- 
flops of these paths are combined in a group. As a result: 

- Each critical path belongs to some group; 

- Each group contains only connected critical paths; and 

- If two critical paths are connected they belong to the same group. 
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Block 4. The quality Q of the current cell placement is a weighted sum of a function DP_max that shows how 
good is the design timing (it is equals to the maximum path delay DP=D(P)) and a function LG_max that shows 
how good is the design in terms of F1-F4. Namely LG_max is the maximum distance between any two flip-flops 
in partition groups. Thus, if DP_max is small (good timing), but there is at least one group that has two flip-flops 
FF_1 and FF_2 far away from each other, then quality Q has a big value (quality is low). 

Block 5. If quality Q is less than some specified value (or it was not improved during last iteration), then we 
finish placement iterations and transfer control to 6. Otherwise, we continue placement improvement. 

Block. 6. In each group we add one clock buffer. The strength of this clock buffer is determined by maximum 
distance between two flip-flops within the group. If this distance is small, then a week clock driver can be 
selected. If the distance is larger, then stronger buffer should be taken. The clock buffer is placed in the middle 
between the two flip-flops within the group, which have maximum distance between each other. Then the buffer 
is connected to each flip-flop in the group. Thus, a clock sub-net is formed. See example Fig. 14 for Fig. 13. 
Note that this solution, when instead of inserting an additional clock buffer (the current solution), we use 
stronger clock buffer and longer connections is more preferable, because it is better in sense of factors F1-F4. 
Namely, clock skew is easer to minimize in connections than in buffers and connections. Connections have small 
dependence on PVT variation than cells. Crosstalk will be about the same, because it is the same clock sub-net. 
And, finally, if the distances are reasonably long, connections have smaller delay than clock buffers. 
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Fig. 14. Clock buffer selection, placement, and connecting 



Block 7 1 1 are traditional blocks in the current design flow 

Finally, let us consider some modifications needed in case when we want to on purpose introduce an additional 
("useful") delay in some clock sub-nets. The flowchart is still the same (Fig. 12), but the following blocks should 
be modified. 

Block 3. Partitioning all flip-flops. In this case we partition not only FFs in critical paths, but some non-critical 
FFs also. Let explain it on an example in Fig. 15. 
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FF_1 



FF_2 




Timing critical path P1 
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Timing non-critical path P2 








FF_3 
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Fig. 15. A partition group with one timing critical and one non-critical paths 



A known solution is to introduce an additional useful delay to FF_2 and have minimum skew and delay for FF_1 
and FF_3. It is not difficult to implement an additional delay to FF_2, but for this placement (Fog. 15) it is very 
difficult to have zero skew between FF_1 and FF_3 and also small clock delay from clock source to both of 
them, because they are located far away from each other. We suggest creating a partition group that contains 
only FF_1 and FF_2. Then, after several placement iterations we will obtain the solution that is shown in Fig. 16. 
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Fig. 16. An optimal placement of partition group with one timing critical and one non-critical paths 

Block 6. In each group we add one clock buffer and possibly some additional ("useful") clock buffers. The 
strength of this clock buffer is determined by maximum distance between two flip-flops within the group. If this 
distance is small, then a week clock driver can be selected. If the distance is larger, then stronger buffer should be 
taken. The clock buffer is placed in the middle between the two flip-flops within the group, which have maximum 
distance between each other. Then the buffer is connected to each flip-flop in the group. Thus, a clock sub-net 
including group flip-flops is formed. See example Fig. 17 for Fig. 16. 
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Fig. 17. Group clock buffer and group clock sub-net 

Finally, we insert additional clock buffers between main the clock buffer and FF_2. See Fig. 18. One can see that 
data going thru a timing critical path P(FF_1,FF_2) will be triggered with an additional delay. Thus, it allows 
handling very big path delay. At the same time, connections from the clock buffer and FF_1 and FF_3 are short 
with close to zero clock skew. Thus, data going thru a non-timing critical path P(FF_2,FF_3) will be triggered 
with close to zero delay and time from shorten clock period for FF_3 are used for extension of clock period for 
FF_2. 



LSI Logic - Confidential, Proprietary & Attorney-Client Privileged 
(Mail to "Intellectual Property Law Dept. Attn: New Invention Disclosures Paralegal" - M/SAD-106) 



Appendix to Invention Disclosure Form 



Page 22 of 23 



Last Modified: 5/10/03 



LOGIC 



Appenuix to the Invention Disclosure Form Docket Number: 

03-0000 

Title of Your Invention: 




Timing non-critical path P2 



Chip 

Fig. 18. Group clock buffer and additional buffers to FF_2. 



We showed that the new clock synthesis and flow would lead to a small (or useful) clock skew. Also, important 
factors F1-F4 are taken into account during the timing closure of the design without applying more pessimistic 
delays on paths. This will improve design performance. It also makes timing closure easier to meet and minimizes 
design iterations. Finally, it decreases a risk of failure in silicon due to PVT variation or crosstalk effects. 
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